May 23, 2017
In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months earlier.
Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
Original conclusion: The risk of divorce in a heterosexual marriage increases when the wife falls ill, but not the husband.
Corrected conclusion: Based on the corrected analysis, we conclude that there are not gender differences in the relationship between gender, pooled illness onset, and divorce.
"The research environment is fast-paced given the ethos to “publish or perish"."
"[…] research is becoming increasingly complex, with greater calls for transdisciplinary collaborations, “big data,” and more sophisticated research questions and methods […] data sets often have multiple files that require merging, change the wording of questions over time, provide incomplete codebooks, and have unclear and sometimes duplicative variables."
"Given these issues, I would not be surprised if coding errors were fairly common […]"
Source: Karl Broman
Your closest collaborator is you six months ago,
but you don’t reply to emails.
- Mark Holder
We need an environment where
data, analysis, and results are tightly connected, or better yet, inseparable
documentation is human readable and syntax is minimal
Scriptability \(\rightarrow\) R
Literate programming \(\rightarrow\) R Markdown
Version control \(\rightarrow\) Git / GitHub
Learning curve: Point-and-click software (supposedly) have shallower learning curves than scripting languages
Automation: Need to rerun your analysis with new/updated data? Just change the input file.
Collaboration: Sharing your analysis is as easy as sharing your scripts
RSplashScreen
There are a number of other great programming tools out there that can also be used to improve the reproducibility of your analysis
The key is to use some type of language that will allow you to automate and document your analysis
Once you master one language you'll probably find it easier to learn another
You could just type into the command prompt, but that doesn't help much with
or
RSplash
"Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."
"The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other."
With RStudio you can combine your programming and your documentation
RSplashScreen
Markdown is a lightweight markup language for creating HTML (or XHTML) documents.
Markup languages are designed to produce documents from human readable text (and annotations).
Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.
Well, it's R + Markdown:
Ease of Markdown syntax
Rendering of R code to produce output and plots
The Big Five personality traits is a theory of five broad dimensions used by some psychologists to describe the human personality and psyche: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism.
Load data with an R chunk:
big5 <- read.delim("raw-data/big5.txt") %>%
tbl_df() # for formatting
Sources: Wikipedia and http://personality-testing.info/_rawdata/.
big5
## # A tibble: 19,719 × 57 ## race age engnat gender hand source country E1 E2 E3 E4 ## <int> <int> <int> <int> <int> <int> <fctr> <int> <int> <int> <int> ## 1 3 53 1 1 1 1 US 4 2 5 2 ## 2 13 46 1 2 1 1 US 2 2 3 3 ## 3 1 14 2 2 1 1 PK 5 1 1 4 ## 4 3 19 2 2 1 1 RO 2 5 2 4 ## 5 11 25 2 2 1 2 US 3 1 3 3 ## 6 13 31 1 2 1 2 US 1 5 2 4 ## 7 5 20 1 2 1 5 US 5 1 5 1 ## 8 4 23 2 1 1 2 IN 4 3 5 3 ## 9 5 39 1 2 3 4 US 3 1 5 1 ## 10 3 18 1 2 1 5 US 1 4 2 5 ## # ... with 19,709 more rows, and 46 more variables: E5 <int>, E6 <int>, ## # E7 <int>, E8 <int>, E9 <int>, E10 <int>, N1 <int>, N2 <int>, N3 <int>, ## # N4 <int>, N5 <int>, N6 <int>, N7 <int>, N8 <int>, N9 <int>, N10 <int>, ## # A1 <int>, A2 <int>, A3 <int>, A4 <int>, A5 <int>, A6 <int>, A7 <int>, ## # A8 <int>, A9 <int>, A10 <int>, C1 <int>, C2 <int>, C3 <int>, C4 <int>, ## # C5 <int>, C6 <int>, C7 <int>, C8 <int>, C9 <int>, C10 <int>, O1 <int>, ## # O2 <int>, O3 <int>, O4 <int>, O5 <int>, O6 <int>, O7 <int>, O8 <int>, ## # O9 <int>, O10 <int>
You can include script files in your R Markdown document
source("code/01-data-cleanup.R")
ggplot(big5, aes(x = age)) + geom_histogram()
summary(big5$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 13.00 18.00 22.00 26.26 31.00 99.00
Extraversion: Seeking fulfillment from sources outside the self or in community. High scorers are social, low scorers prefer to work alone. Neuroticism: Being emotional.
m_ext_age <- lm(extraversion ~ neuroticism * gender, data = big5) summary(m_ext_age)
## ## Call: ## lm(formula = extraversion ~ neuroticism * gender, data = big5) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.3125 -6.3391 0.0132 6.6079 26.0924 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.202758 0.190240 79.913 < 2e-16 ## neuroticism 0.297346 0.009615 30.925 < 2e-16 ## genderMale -1.893017 0.327308 -5.784 7.42e-09 ## genderOther -5.721794 2.177580 -2.628 0.00861 ## neuroticism:genderMale 0.001576 0.015226 0.104 0.91755 ## neuroticism:genderOther -0.008332 0.125205 -0.067 0.94694 ## ## Residual standard error: 8.854 on 19605 degrees of freedom ## (24 observations deleted due to missingness) ## Multiple R-squared: 0.08003, Adjusted R-squared: 0.0798 ## F-statistic: 341.1 on 5 and 19605 DF, p-value: < 2.2e-16
ggplot(data = big5, aes(x = neuroticism, y = extraversion, color = gender)) + geom_point(alpha = 0.5) + geom_jitter() + geom_smooth(method = "lm")
big5_teen <- filter(big5, age <= 19)
m_ext_age_teen <- lm(extraversion ~ age * gender, data = big5_teen) summary(m_ext_age_teen)
## ## Call: ## lm(formula = extraversion ~ age * gender, data = big5_teen) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.8426 -6.9399 0.0037 7.0601 22.6662 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.12536 1.43788 9.824 < 2e-16 ## age 0.30091 0.08502 3.539 0.000404 ## genderMale 6.78702 2.47559 2.742 0.006131 ## genderOther 6.66006 11.01228 0.605 0.545342 ## age:genderMale -0.42066 0.14590 -2.883 0.003949 ## age:genderOther -0.76174 0.66364 -1.148 0.251085 ## ## Residual standard error: 9.366 on 6740 degrees of freedom ## (10 observations deleted due to missingness) ## Multiple R-squared: 0.005666, Adjusted R-squared: 0.004929 ## F-statistic: 7.681 on 5 and 6740 DF, p-value: 3.274e-07
ggplot(data = big5_teen, aes(x = neuroticism, y = extraversion, color = gender)) + geom_point(alpha = 0.5) + geom_jitter() + geom_smooth(method = "lm")
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.
Source: Piled Higher and Deeper by Jorge Cham, http://www.phdcomics.com.
2013-10-14_manuscriptFish.doc
2013-10-30_manuscriptFish.doc
2013-11-05_manusctiptFish_intitialRyanEdits.doc
2013-11-10_manuscriptFish.doc
2013-11-11_manuscriptFish.doc
2013-11-15_manuscriptFish.doc
2013-11-30_manuscriptFish.doc
2013-12-01_manuscriptFish.doc
2013-12-02_manuscriptFish_PNASsubmitted.doc
2014-01-03_manuscriptFish_PLOSsubmitted.doc
2014-02-15_manuscriptFish_PLOSrevision.doc
2014-03-14_manuscriptFish_PLOSpublished.doc
Everytime you make a save, you zip the entire directory that your project files are in and save it with a date.
Source: https://github.com/mine-cetinkaya-rundel/2016-01-11-reproducible-research-unc/.
Start with a base version of the document, save just the changes you made at each step of the way.
Think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.
Source: Software Carpentry.
Everyone struggles with reproducibility and it is a hindrance to moving science forward.
#1 Adopt a reproducible research workflow
#2 Train new researchers who don’t have any other workflow
two prongs
R Markdown: http://rmarkdown.rstudio.com/
RStudio and git: https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN
Try Git: https://try.github.io
Reproducible Science Curriculum (2 day workshop): https://github.com/Reproducible-Science-Curriculum/
\(\hat{y} = \beta_0 + \beta_1 \times x\)